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ABSTRACT 



Apparatus and methods are disclosed for scheduling target 
program instructions during the code optimization pass of an 
optimizing compiler. Most modern microprocessors have 
the ability to issue multiple instructions in one clock cycle 
and/or possess multiple pipelined functional units. They also 
have the ability to add two values to form the address within 
memory load and store instructions. In such microprocessors 
this invention can, where applicable, accelerate the execu- 
tion of modulo-scheduled loops. The invention consists of a 
technique to achieve this speed up by systematically reduc- 
ing the number of certain overhead instructions in modulo 
scheduled loops. The technique involves identifying reduc- 
ible overhead instructions, scheduling the balance of the 
instructions with normal modulo scheduling procedures and 
then judiciously inserting no more than three copies of the 
reducible instructions into the schedule, 

19 Claims, 11 Drawing Sheets 
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METHOD AND APPARATUS FOR 
INSTRUCTION SCHEDULING IN AN 
OPTIMIZING COMPILER FOR MINIMIZING 
OVERHEAD INSTRUCTIONS 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

This invention relates to the field of Optimizing Compil- 
ers for computer systems. More specifically, the invention is 
an improved method and apparatus for scheduling target 
program instructions during the code optimization pass of an 
optimizing compiler. 

2. Background 

It is desirable that computer programs be as efficient as 
possible in their execution time and memory usage. This 
need has spawned the development of computer architec- 
tures capable of executing target program instructions in 
parallel. A recent trend in processor design is to build 
processors with increasing instruction issue capability and 
many functional units. Some examples of such designs are 
Sun's UltraSparc™ (4 issue), IBM's PowerPC™ series (2—4 
issue), MIPS' RIOOOO™ (5 issue) and Intel's Pentium- 
Pro™ (aka P6) (3 issue). (These processor names are the 
trademarks respectively of Sun Microsystems, Inc., IBM 
Corporation, MIPS Technologies, Inc., and Intel 
Corporation). At the same time the push toward higher clock 
frequencies has resulted in deeper pipelines and longer 
instruction latencies. These and other computer processor 
architectures contain multiple functional units such as I/O 
memory ports, integer adders, floating point adders, 
multipliers, etc. which permit multiple operations to be 
executed in the same machine cycle. The process of opti- 
mizing the target program's execution speed becomes one of 
scheduling the execution of the target program instructions 
to take advantage of these multiple computing resource units 
or processing pipelines. This task of scheduling these 
instructions is performed as one function of an optimizing 
compiler. Optimizing compilers typically contain a Code 
Optimization section which sits between a compiler front 
end and a compiler back end. The Code Optimization 
section takes as input the "intermediate code" output by the 
compiler front end, and operates on this code to perform 
various transformations to it which will result in a faster and 
more efficient target program. The transformed code is 
passed to the compiler back end which then converts the 
code to a binary version for the particular machine involved 
(i.e. SPARC, X86, IBM, etc). The Code Optimization sec- 
tion itself needs to be as fast and memory efficient as it 
possibly can be and needs some indication of the computer 
resource units available and pipelining capability of the 
computer platform for which the target program code is 
written. 

For example, most code optimization sections attempt to 
optimize scheduling of the target program instructions with 
respect to the number and kind of computing resources 
available on a particular target hardware platform. Such 
computing resources include but are not limited to; the 
number of integer adders, floating point adders, the number 
of the available CPU registers, the extent of the CPU 
instruction pipeline and the number and kind of instruction 
caches available in the target computer. This instruction 
scheduling is done in an attempt to; minimize the execution 
delays caused by latency on necessary input data (i.e. by 
instructions having to wait on necessary data to be made 
available from previous instructions); to reduce the number 
of instructions required for a specific calculation to the 
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extent possible; and to schedule instruction execution so as 
to reduce the contention for available CPU registers (thereby 
reducing what is known as "register spilling" in subsequent 
sections of the code optimization processing). This instruc- 

5 tion scheduling process focuses on basic blocks in the target 
program code which could involve operating on hundreds of 
instructions in the average target program being compiled 
and could involve 10 to 20 thousand variables in scientific 
target programs. These basic blocks typically containing any 

!0 number of loops, each of which typically contains 10-15 
instructions with 40-50 variables involved. A basic block is 
a sequence of consecutive statements in which flow of 
control enters at the beginning of the block and leaves at the 
end of the block without halt or the possibility of branching 

15 except at the end. 

In the past, attempts have been made to develop optimiz- 
ing compilers generally, and code optimizer modules spe- 
cifically which themselves run as efficiently as possible. A 
general discussion of optimizing compilers and the related 

20 techniques used can be found in the text book "Compilers: 
Principles, Techniques and Tools" by Alfred V. Aho, Ravi 
Sethi and Jeffrey D. Ullm an, Addison- Wesley Publishing Co 
1988, ISBN 0-201-10088-6, especially chapters 9 & 10 
pages 513-723. One such attempt at optimizing the sched- 

25 uling of instructions in inner-loops in computer platforms 
with one or more pipelined functional units is a technique 
called "modulo scheduling." Modulo scheduling is known in 
the art and is generally described in the paper titled "Some 
Scheduling Techniques and An Easily Schedulable Horizon- 

30 tal Architecture for High Performance Scientific Comput- 
ing" by B. R. Rau and C. D. Glaeser, Proceedings of the 
Fourteenth Annual Workshop on Microprogramming, 
Advanced Processor Technology Group, ESL, Inc., October 
1981, pages 183-198, which is incorporated fully herein by 

35 reference. Modulo scheduling is one form of software pipe- 
lining that extracts instruction level parallelism from inner 
loops by overlapping the execution of successive iterations. 
A brief summary of modulo scheduling is given in the 
detailed description below. 

40 There are many important problems that have to be 
overcome when modulo scheduling is used to target modern 
micro-processors and effectively compile a wide variety of 
programs. For example, scheduling techniques as described 
in the prior art do not attempt to systematically amortize or 

45 reduce "loop overhead" instructions. (Loop overhead 
instructions are instructions which the compiler must insert 
in the executable code to load and store intermediate counter 
values, to increment or decrement certain counters or array 
addresses, etc.) Such prior art techniques generally rely on 

50 architectures such as Very Long Instruction Word (VLIW) 
architectures, that provide the ability to issue a large number 
of instructions in one clock cycle and thereby make such 
amortization unnecessary. Moreover, some machines such 
as the Cydra 5 do not possess the ability to add two values 

55 to form the address used within memory load and store 
instructions, a feature which is required for effective reduc- 
tion of address computations. Most modern 
microprocessors, on the other hand, do provide such a 
feature. To keep the instruction fetch bandwidth requirement 

60 low, these processors also limit the number of instructions 
that can be issued together in one clock cycle. Therefore, on 
these processors, if the number of loop-overhead instruc- 
tions is reduced, then a higher number of useful instructions 
can be issued in the same time to perform the desired 

65 computation faster. The invention described herein does this 
systematically for modulo scheduling loops, effectively 
improving machine utilization. 
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The present invention uses an elegant method to reduce 
the number of loop overhead instructions needed in the 
executable code for a loop in a target program. This inven- 
tion is contained in the scheduling section of an optimizing 
compiler which uses modulo scheduling techniques thereby 5 
improving the execution speed of the executable code on a 
target computer platform. 

SUMMARY OF THE INVENTION 

The present invention overcomes the disadvantages of the 10 
above described systems by providing an economical, high 
performance, adaptable system and method for reducing the 
execution time of target programs by reducing the number of 
executable instructions that are required for the target pro- 
gram. The present invention provides an apparatus and 15 
method for identifying target program loop instructions 
which are reducible and using only a reduced number of 
copies of those instructions in the executable code. 

In one aspect of the invention, the instructions are first 

20 

separated into two classes; those which are reducible and 
those which are not. Then the non-reducible instructions are 
modulo scheduled as before. After this scheduling step, the 
reducible instructions are judiciously inserted at most once 
in each of the three sections of the Prologue/Keme/Epiiogue 
sections of the schedule. These insertions of copies are only 25 
made in each section if there are other instructions in that 
section which needed a value from the copy. Then in each 
section, each non-reducible instruction which uses the value 
produced by the copied reducible instruction is adjusted so 
as to operate properly. 

In another aspect of the invention, a computer system is 
disclosed which has a central processing unit (CPU) and 
random access memory (RAM) coupled to said CPU, for use 
in compiling a target program to run on a target computer 35 
architecture having at least one parallel computation unit 
which facilitates instruction pipelining and which provides 
an ability to add at least one value to form an address used 
in a memory load or store instruction and which permits two 
or more instructions to be issued in a single clock cycle, the 4Q 
computer system having an optimizing compiler capable of 
modulo scheduling instructions for a target program, 
wherein the code optimizer part of the compiler can partition 
instructions for the target program into reducible instruc- 
tions and non-reducible instructions, and wherein the 45 
modulo scheduler part of the compiler can schedule the 
non-reducible instructions, and wherein the reducible 
instructions can be inserted directly into the schedule of the 
non-reducible instructions and wherein any non-reducible 
instructions in the schedule which require use of the reduc- 5Q 
ible instructions have their original offset values adjusted as 
a function of their position in the schedule and their location 
in the schedule relative to the reducible instruction whose 
use they require. 

In yet another aspect of the invention, a method for 5S 
performing the code minimization is disclosed. And in still 
a further aspect of the invention a computer program 
product, embedded in a computer readable memory config- 
ured to perform the code optimization steps is disclosed. 

DESCRIPTION OF THE DRAWINGS 60 

The objects, features and advantages of the system of the 
present invention will be apparent from the following 
description in which: 

FIG. 1 illustrates a portion of a computer, including a 65 
CPU and conventional memory in which the presentation 
may be embodied. 



4 

FIG. 2 illustrates a typical compiler showing the position 
of the code optimizer. 

FIG. 3 illustrates a large scale organization of a code 
optimizer. 

FIG. 4 illustrates an organization of the Instruction Sched- 
uling portion of FIG. 3 as typical in the Prior Art use of 
modulo scheduling. 

FIG. 5 illustrates a four stage seven iteration pipeline. 

FIG. 6 illustrates a flow chart of a revised modulo 
scheduling system wherein reducible instructions are iden- 
tified and scheduled separately. 

FIG. 7 illustrates in more detail a process for identifying 
reducible instructions in a loop in the target program. 

FIGS. 8-10 illustrate in more detail a process for insertion 
of identified reducible instructions in scheduled PKE code. 

FIG. 11 illustrates the relationship between a program 
loop and the associated PKE code. 

NOTATIONS AND NOMENCLATURE 

The detailed descriptions which follow are presented 
largely in terms of procedures and symbolic representations 
of operations on data bits within a computer memory. These 
procedural descriptions and representations are the means 
used by those skilled in the data processing arts to most 
effectively convey the substance of their work to others 
skilled in the art. 

A procedure is here, and generally, conceived to be a 
self-consistent sequence of steps leading to a desired result. 
These steps are those requiring physical manipulations of 
physical quantities. Usually, though not necessarily, these 
quantities take the form of electrical or magnetic signals 
capable of being stored, transferred, combined, compared, 
and otherwise manipulated. It proves convenient at times, 
principally for reasons of common usage, to refer to these 
signals as bits, values, elements, symbols, characters, terms, 
numbers, or the like. It should be understood, however, that 
all of these and similar terms are to be associated with the 
appropriate physical quantities and are merely convenient 
labels applied to these quantities. 

Further, the manipulations performed are often referred to 
in terms, such as adding or comparing, which are commonly 
associated with mental operations performed by a human 
operator. No such capability of a human operator is 
necessary, or desirable in most cases, in any of the opera- 
tions described herein which form part of the present inven- 
tion; the operations are machine operations. Useful 
machines for performing the operations of the present inven- 
tion include general purpose digital computers or similar 
devices. Id all cases there should be understood the distinc- 
tion between the method operations in operating a computer 
and the method of computation itself. The present invention 
relates to method steps for operating a computer in process- 
ing electrical or other (e.g., mechanical, chemical) physical 
signals to generate other desired physical signals. 

The present invention also relates to apparatus for per- 
forming these operations. This apparatus may be specially 
constructed for the required purposes or it may comprise a 
general purpose computer as selectively activated or recon- 
figured by a computer program stored in the computer. The 
procedures presented herein are not inherently related to a 
particular computer or other apparatus. In particular, various 
general purpose machines may be used with programs 
written in accordance with the teachings herein, or it may 
prove more convenient to construct more specialized appa- 
ratus to perform the required method steps. The required 
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structure for a variety of these machines will appear from the pilers" and include such code transformations as common 

description given. sub-expression eliminination, dead-code elimination, 

renaming of temporary variables and interchange of two 

DESCRIPTION OF THE PREFERRED independent adjacent statements as well as register alloca- 

EMBODIMENT 5 tion. 

Apparatus and methods are disclosed for scheduling tar- FIG * 3 depicts a typical organization of an optimizing 

get program instructions during the code optimization pass compiler 40. On entry of the intermediate code 42 a Control 

of an optimizing compiler. Most modern microprocessors V } ow Gra P h * constructed 44. At this stage the aforemen- 

have the ability to issue multiple instructions in one clock tioned code transformations (common sub-expression 

cycle and/or possess multiple pipelined functional units. 10 elimination, dead-code elimination, renaming of temporary 

They also have the ability to add two values to form the variables and interchange of two independent adjacent 

address within memory load and store instructions. In such statements, etc.) take place 46. Next instruction scheduling 

microprocessors this invention can, where applicable, accel- or "pipelining" may take place 48 at this point. Then 

erate the execution of modulo-scheduled loops. The inven- "register allocation" is performed 50 and the modified code 

tion consists of a technique to achieve this speed up by 15 * written out 52 for the compiler back end to convert it to 

systematically scheduling certain overhead instructions in mc binary language of the target machine (i.e. SPARC, X86, 

modulo scheduled loops. The disclosed invention reduces ctc )- H * this "Instruction Scheduling" 48 process which is 

the number of loop overhead instructions needed in the mc focus of thc applicants 1 invention, 

instruction schedule for processing a loop in the target Instruction Scheduling 

program on a computer platform that permits instruction 20 Referring now to FIG. 4, a general flow chart of the prior 

pipelining. In the following description, for purposes of art Optimizing Compiler Modulo Scheduling operation is 

explanation, specific instruction calls, modules, etc., are set depicted 100. Upon entry to this section of the Optimizing 

forth in order to provide a thorough understanding of the Compiler 102 incoming intermediate data is processed and 

present invention. However, it will be apparent to one skilled the data representing a loop is used to construct a Data 

in the art that the present invention may be practiced without 25 Dependency Graph (DDG) 104. Using this DDG the sched- 

these specific details. In other instances, well known circuits uler determines a theoretical maximum throughput possible 

and devices are shown in block diagram form in order not to for this loop, given all the data dependencies and the 

obscure the present invention unnecessarily. Similarly, in the resource requirements 106. That is, considering the data 

preferred embodiment, use is made of uni-processor and ^ dependencies of each instruction and the resource require- 

multi-processor computer systems as well as the SOLARIS ments (such as a memory port, integer add unit, floating 

operating system, including specifically the SUN ULTRAS- point unit, etc.) a calculation is made to determine the 

PARC processor and the SUN SPARC compiler version 4.0, minimum iteration interval (mii). Next all instructions in the 

all of which are made and sold by Sun Microsystems, Inc. loop are scheduled obeying the modulo constraint 108. The 

the assignee of this present invention. However the present output of the scheduling pass 108 is a schedule in PKE 

invention may be practiced on other computer hardware 35 format 110, and the scheduling process for the loop is 

systems and using other operating systems. completed 112. 

Operating Environment Brief Summary of Modulo Scheduling 

The environment in which the present invention is used Modulo scheduling has been described in the literature as 

encompasses the general distributed computing system, 40 indicated above. Nevertheless it is helpful at this point to 

wherein general purpose computers, workstations, or per- summarize the process for completeness. The key principles 

sonal computers are connected via communication links of are as follows. Parallel instruction processing is obtained by 

various types, in a client-server arrangement, wherein pro- starting an iteration before the previous iteration has com- 

grams and data, many in the form of objects, are made pie ted. The basic idea is to initiate new iterations after fixed 

available by various members of the system for execution 45 time intervals. This time interval is called the initiation 

and access by other members of the system. Some of the interval or the iteration interval (II). FIG. 5 shows the 

elements of a general purpose workstation computer are execution of seven iterations of a pipelined loop. Let the 

shown in FIG. 1, wherein a processor 1 is shown, having an scheduled length of a single iteration be TL 138 and let it be 

Input/output ("I/O") section 2, a central processing unit divided into stages each of length II 126. The stage count, 

("CPU") 3 and a memory section 4. The I/O section 2 is 50 SC is defined as, SC=[TL/II], or in this case TL-4 (138 in 

connected to a keyboard 5, a display unit 6, a disk storage FIG. 5) and II -1 126 and so SC-[4/l]-4. Loop execution 

unit 9 and a CD-ROM drive unit 7. The CD-ROM unit 7 can begins with stage 0 140 of the first iteration 128. During the 

read a CD-ROM medium 8 which typically contains pro- first II cycles, no other iteration executes concurrently. After 

grams 10 and data. the first II cycles, the first iteration 128 enters stage 1 and the 

FIG. 2 illustrates a typical optimizing compiler 20, com- 55 second iteration 142 enters stage 0. 

prising a front end compiler 24, a code optimizer 26 and a New iterations join every II cycles until a state is reached 

back end code generator 28. The front end 24 of a compiler when all stages of different iterations are executing. Toward 

takes as input a program written in a source language 22 and the end of loop execution no new iterations are initiated and 

performs various lexical, syntactical and semantic analysis those that are in various stages of progress gradually com- 

on this language outputting an intermediate set of code 32 60 plete. 

representing the target program. This intermediate code 32 These three phases of loop execution are termed the 

is used as input to the code optimizer 26 which attempts to prologue 130, the kernel 132 and the epilogue 134. During 

improve the intermediate code so that faster-running the prologue 130 and the epilogue 134 not all stages of 

machine (binary) code 30 will result. Some code optimizers successive iterations execute. This happens only during the 

26 are trivial and others do a variety of computations in an 65 kernel phase 132. The prologue 130 and the epilogue 134 

attempt to produce the most efficient target program pos- last for (SC- 1)* II cycles. If the trip count of the loop is large 

sible. Those of the latter type are called "optimizing com- (that is, if the loop is of the type where say 10 iterations of 
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the loop are required), the kernel phase 132 will last much 
longer than the prologue 130 or the epilogue 134. The 
primary performance metric for a modulo scheduled loop is 
the initiation interval, II 126. It is a measure of the steady 
state throughput for loop iterations. Smaller II values imply 
higher throughput. Therefore, the scheduler attempts to 
derive a schedule that minimizes the II. The time to execute 
n iterations is, T(n)»(n+SC-l)xII. The throughput 
approaches II as n approaches infinity. 

Scheduling proceeds as follows. The data dependence 
graph (DDG) for the loop is constructed. Nodes in this 
(directed) graph correspond to instructions, and arcs to 
dependences between them. Arcs possess two attributes: 
latency and omega. Latency is the number of clocks of 
separation required between the source and the destination, 
and omega is the iteration distance between the two. (That 
is, if the source instruction calculates a value for the desti- 
nation instruction which is to be used in the next iteration, 
the omega value would be 1. If the value were to be used two 
iterations after it was calculated omega would be 2, etc.). 
Prior to scheduling, two bounds on the maximum 
throughput, the Mil and the RMII, are derived. The Mil is 
a bound on the minimum number of cycles needed to 
complete one iteration and is based only on processor 
resources. For example, if a loop has 10 add operations and 
the processor can execute at most two adds per clock, then 
the add unit resource would limit the iteration throughput to 
at most one every five clocks. The Mil is computed by 
taking each resource in turn and then taking the maximum 
of the bounds imposed by each. The RMII is a bound based 
on the minimum number of clocks needed to complete one 
iteration and is based only on dependences between nodes. 
Cycles in the DDG imply that a value Xj computed in some 
iteration i is used in a future iteration j and is needed to 
compute the similarly propagated value in iteration j. These 
circular dependences place a limit on how rapidly iterations 
can execute because computing the values needed in the 
cycle takes time. For each elementary cycle in the DDG, the 
ratio of the sum of the latencies (1) to the sum of the omegas 
(d) is computed. This value limits the iteration throughput 
because it takes 1 clocks to compute values in a cycle that 
spans d iterations. 

The fixed spacing between overlapped iterations forces a 
constraint on the scheduler other than the normal constraints 
imposed by the arcs in the DDG. Note that placing an 
operation at a time t implies that there exists a corresponding 
operation in the kth future iteration at (t+k*II). Operations 
using the same resource must be placed at different times, 
modulo the II. This is referred to as the "modulo constraint". 
It states that if an operation uses a resource at time tj and 
another operation uses exactly the same resource at time U, 
then i y and U must satisfy "^modulo II is not equal to 
^modulo ir\ The scheduler begins by attempting to derive 
a schedule using II-max(MII, RMII). If a schedule is not 
found, the II is incremented. The process repeats until a 
schedule is found or an upper limit is reached. After 
scheduling, the kernel has to be unrolled and definitions 
renamed to prevent values from successive iterations from 
overwriting each other, "Unrolling the kernel" is defined as 
creating multiple copies of the kernel in the generated code. 
The minimum kernel unroll factor (KUF) needed is deter- 
mined by the longest value lifetime divided by the II because 
corresponding new lifetimes begin every II clocks. (The 
"lifetime" of a value is equal to the time for which a value 
exists; i.e. from the time its generation is started until the last 
time instant when it is or could be used.). Remainder 
iterations (up to KUF-1) use a cleanup loop. 
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The Invention — Modified Modulo Scheduling 
The basic idea of the invention is shown in FIG. 6. As 
before, on entry 152, a DDG is constructed for the next loop 
to be scheduled 154. Then the set of instructions in the input 

5 DDG is partitioned into two, wherein the reducible (loop- 
overhead) instructions are identified 156. One part of the 
partition contains the non-reducible instructions and the 
other part contains the reducible instructions. The former is 
modulo scheduled 158, 160 and then, after a schedule has 

10 been derived, the reducible instructions are introduced into 
the previously derived schedule 162. The key difference 
from previous approaches is that the two parts are not 
modulo scheduled together. By recognizing and eliminating 
the reducible instructions from first class consideration, 

15 attention can be devoted exclusively to the more useful 
non-reducible instructions. However, reducible instructions 
are necessary and allowance has to be made to reintroduce 
them into the schedule later. The invention permits this by 
judiciously adjusting certain parameters when the non- 
20 reducible instructions are scheduled. Some important steps 
are required for this mechanism to work. These are 
described in detail below. 
Identifying Reducible Instructions 

25 A loop instruction is reducible if multiple iterations (more 
than one) of the loop can be executed without having to 
execute the reducible instruction in every iteration. In 
general, an instruction is reducible if a mechanism can be 
found whereby instructions that use it's result can be modi- 

30 fied such that the modified versions do not require the 
previous reducible instruction to be executed. That is, an 
instruction, y=f(xl, x2, . . . ) which feeds another instruction 
z=g(y, ul, u2, . . . ) is reducible if the latter instruction can 
be modified or rewritten to directly compute z*g'(xl, x2, . . . 

35 ul u2, . . . ). In this case y may be said to be reducible in the 
first degree with respect to z. Similarly, higher degrees of 
reducibility could be defined. The minimum of the degree of 
reducibility of an instruction with respect to all its uses is the 
limiting factor in determining the reducibility of an instruc- 

^ tion. In the preferred embodiment it is assumed that reduc- 
ible means reducible to an unbounded degree (unbounded 
with the limits of the computer representation of data which 
is in reality finite) with respect to all uses. 
In the preferred embodiment of the present invention, the 

45 Sun Microsystems, inc. SC 4.0 compiler developed for the 
UltraSparc processor currently uses the following criteria for 
identifying and tagging reducible instructions: 

If the instruction is an integral self-increment or self- 
decrement of an induction variable, and 

50 If the induction variable is incremented or decremented 
by a compile-tirae known integer constant, and 

If all uses of the instruction can be modified to have an 
immediate displacement representing the computed result of 
the reducible instruction, (Note that for the UltraSparc 

55 instruction set the address portion of memory operations 
allow for an immediate displacement), 

OR, if the instruction is the loop exit test or the loop back 
branch, then, the instruction is tagged as a reducible ins true- 

60 *»• 

Examples of reducible instructions identified by the SC 
4.0 compiler include array address increments that feed 
memory operations (loads and stores), the loop control 
increment instruction, the loop exit test instruction and the 
65 loop back branch. 

Referring now to FIG. 7 the procedure for identifying 
reducible instructions in the preferred embodiment is 
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depicted 200. (This description covers in more detail the step calculated by taking the maximum over all resources of the 

identified above as block 156 in FIG. 6). On entry to the ratio of the resource requirements to the resource copies, 

block 202, if there are no instructions to be tested 204, 206 step 5 Conditionally increment mii_without 

the step is completed 208. If there are instructions to be If (he value of ^^thorn compu ,ed above is such that 

tested 210 the next instruction in the DDG is obtained 212. 5 ^ exfets an identified reducib i e instruction requiring a 

If the instruction x not an Integra self-increment or self- resoufCe ^ ies o£ which afe emire , consumed by 

decrement of an tnduction variable 216 it ^ considered non . reducible insertions, then increase mii_witho.it by 1. 

non-reducible and the routine returns to block 204 via block ^ must be done because tf jt wefe nQt> and ^ loop were 

B 234. If the instruction is an mtegral self-increment or M}M at the rate computed in step 4, then there would 

self-decrement of an mduction variable 218 the instruction 10 be M rQQm avaUable fof the feduciWe ^0^^. Note , hat 

is tested to see if the induction vanable is incremented or ^is is not the same as accommodating the reducible instruc- 

decremented by a compile-time known integer constant 220 tions in st 4 [n fad> ft ^ because , he iostructions are 

If not 222 this instruction does not qualify as reducible and reduciWe ^ an mcrement by j limeSi each reduc ible 

the rouune returns to block 204 If the mstruction does pass ^ ^ placed once m the unrolled kernel . For 

this test 224 a test is made to determine if all uses of this is x ^is increment of mii_without may not be 

instruction can be modified to have an immediate displace- ired ^ thefe ^ ^ enQu ^ to accom . 

ment representing the computed result of the instruction modate ^ reducible i nstruc tions. 

226. If so 230 this instruction is tagged as a reducible „,„,., , ,. „ 

instruction 232 and the routine returns to block 204 via block Sle P 6 0btaiD the value for mn takm 8 both resource and 

B 234. If not 228, then it is tested to see if the instruction is 20 recurrence constraints 

the loop exit or loop back branch instruction 238. If so 240, FiQ d the upper bound on the throughput achievable by 

this instruction is tagged as a reducible instruction 232 and obtaining the maximum of the mii_without and the rmii 

the routine returns to block 204 via block B 234. If not 242 (obtained by considering the longest recurrence eye times, 

the instruction is deemed to be non-reducible and the routine each reducible instruction can be placed once in the unrolled 

returns to block 204. After all instructions (nodes) in the 25 kernel. For many loops, this increment of mu_without may 

DDG for the loop have been checked, any reducible loop- not bc required as there may already be enough space to 

overhead instructions have been tagged and will not be accommodate the reducible instructions, 

modulo scheduled with the non-reducible ones but rather Step 6 Obtain the value for "mii" taking both resource and 

will be inserted into the schedule as described in more detail recurrence constraints 

as follows. It should be realized by those skilled in the art, 30 pj n d trje upper bound on the throughput achievable by 

that while this specific test criteria for identifying reducible obtaining the maximum of the mii_without and the rmii 

instructions is used in the preferred embodiment, various (obtained by considering the longest recurrence cycle in the 

other criteria for identifying reducible instructions may be i oop graph). This value is used in the next step wherein 

used and should be considered to be within the bounds of the modulo scheduling of the non-reducible instructions 

invention claimed herein. 35 attempts to attain a throughput as close to this maximum 

The Preferred Embodiment in Further Detail value as possible. 

The following describes the preferred embodiment in Step 7 Derive a Modulo schedule for the non-reducible 

additional detail. After partitioning the instructions into a instructions 

reducible set and a non-reducible set (call this Step 1), the i D ^ step> a mo dulo schedule is derived for the non- 
following steps are performed. reducible instructions of the loop. Let the derived schedule 

Step 2 Compute the resource requirements for the reduc- correspond to an execution rate of one iteration every II 

ible instructions clock cycles. 

For each resource in the machine model compute the total step 8 Compute MKUF (Minimum Kernel Unroll Factor) 

number of time units for which that resource is used by the 45 to Accommodate the Reducible Instructions 

reducible instructions. For example, if there are 6 reducible Iq ^ stepj a , ower bound is canted for me « kerocl 

address add instructions and each of them uses the adder unroU hciof , (kuf) ^ mvcntion ^ aggressive in sched- 

resource for two time units, then the total resource require- uling just the non . reducible instructions. After they have 

ment for this resource is 12. becn scheduled, room must be made for the reducible 

Step 3 Compute the resource requirements for the non- 5Q instructions. The number of existing empty slots in the 

reducible instructions schedule derived for the non-reducible instructions and the 

Perform the above step for the non-reducible instructions. number of slots required for the reducible instructions 

Step 4 Compute the value "mii without" together determine the minimum value imposed on "kuf." 

Modulo scheduling attempts to execute loop iterations at For example, if the reducible instructions require 6 slots of 

the fastest possible rate. Before a schedule is attempted, an 55 a resource and two empty slots are available in one copy of 

upper bound (aggressive estimate) on this rate calculated the kernel after the non-reducible instructions have been 

(designated "mii"). The scheduler attempts to achieve this scheduled, then the kernel must be unrolled at least three 

target rate; if it fails, the goal is relaxed and a new schedule times (three copies of the kernel are needed) to accommo- 

is attempted. date the reducible instructions. The value must be calculated 

In this step, such a target is computed ignoring the 60 for each resource used by some reducible instruction and the 

reducible instructions (designated "mii_without")- For maximum value chosen. Computationally, 

example, if there are 6 non-reducible multiply instructions in , , w 4 v//r , , 
a loop and the machine can execute at most 2 multiplies in 

one time unit, then each loop iteration will take at least where i denotes a resource, r, is the number of copies of this 

mii_without-3 time units. Using the resource requirements 65 resource used by the reducible instructions, ni is the number 

for the non-reducible instructions and knowing the number of copies of this resource available in one copy of the kernel 

of copies available of each resource, this target can be after the non-reducible instructions have been scheduled and 
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the "max" is taken over all resources i such that r,- is greater 
than 0. After the lower bound, "mkuf is determined, the 
"kuf" is set to be the maximum of itself and this bound. That 
is, 

kuf-Max((kuf, mkuf)) 

Step 9 Generate Code and insert the reducible instructions 
in the schedule 

Now the prologue/Kernel/Epilogue (PKE) code for the 
loop is derived by repeating the modulo schedule obtained 
for the non-reducible instructions every II cycles for a total 
of N times, where 

N«stagecount— 1+KUF 

StageCount in this formula is defined in the prior art, and, 
refers to the number of conceptual stages in the software 
pipeline of the modulo scheduled loop. 

The Prologue and the Epilogue regions represent the fill 
and the drain regions of the pipeline respectively. The 
Kernel is the steady state region of the pipeline where 
iteration is performed. Scheduling of each reducible instruc- 
tion is performed as follows in the preferred embodiment 
using the SC 4.0 compiler: 

1) Find "MaxUselnPrologue" as follows: Let the result 
computed by a reducible instruction be used by some other 
instructions, say, l lt I 2 , I 3 , . . . 1^ Let C a ^ . . . be the 
number of copies of I lt I 2 , 1 3 , . . . I* respectively placed in 
the prolog. Then 

MaxUselnPrologue-max^, Cj, Ck) 

2) Place the reducible instruction in the kernel and adjust 
the displacements on its uses: 

Find the first available slot in the kernel, Ti, counting from 
the end of the kernel, and schedule the instruction. Change 
the increment on the instruction to be 
"Origimllncrement* KUF\ 

For the uses scheduled before Ti replace the original 
displacements by adding the following term: 

(CopyNumberOfNodeInKernel-Omega + 
TotalCopiesOfNodeinPrologue-MaxUselnPrologue)- 

* OriginalDisplacement. 

For the uses scheduled after Ti replace the original 
displacements by subtracting the term: (KUF- 
CopyNumberOfNodeInKernel + Omega + 
MaxUseinPrologue-TotalCopiesOfNodelnPrologue) 

* OriginalDisplacement. 

where "CopyNumberOfNodelnKerner ranges from 1 to 
Kernel Unroll Factor (kuf), 

"Omega" is the dependence distance on the arc from the 
reducible instruction to the use, "TotalCopiesOfNodelnPro- 
logue" is the total number of copies (if any) of this use 
placed in the Prologue, and 

"OriginalDisplacement" is the original displacement on 
the use. 

3) For each reducible instruction that has at least one use 
in the Prologue, place a copy of the instruction in the 
prologue and adjust displacements as follows: 

Find the first available slot in the prologue, Ti, and 
schedule the instruction. If there is no space available in the 
Prologue, simply prepend the instruction to the Prologue. 

For the uses in prologue scheduled before Ti replace the 
original displacements by adding the following term: 
(CopyNumberOfNodelnPrologue-Omega) 

* OriginalDisplacement 

For the uses scheduled after Ti replace the original 
displacements by subtracting the term: (Omega+ 
MaxUselnPrologue-CopyNumberOfNodelnPrologue) 
•OriginalDisplacement 

4) For each reducible instruction that has at least one use 
in the Epilogue, place a copy of the instruction in the 
prologue and adjust displacements as follows: 
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Find the first available slot in the epilogue, Ti, and 
schedule the instruction. If there are no empty slots 
available, append it to the epilogue. Set Ti=(StageCount- 
1+KUF)*II+1. 

5 For the uses in epilogue scheduled before Ti replace the 
original displacements by adding the following term: 

(CopyNumberOfNodeInEpilogue-Omega + 
TotalCopiesOfNodelnPrologue-MaxUselnPrologue) 

* OriginalDisplacement. 

10 For the uses in epilogue scheduled after Ti, replace the 
original displacements by subtracting the term: 

(StageCount-l-CopyNumberOfNodeInEpilogue + 
Omega TotalCopiesOfNodelnPrologue) 

* OriginalDisplacement. 

!5 The following simple example illustrates the key points 
described above. The details of the scheduling process, the 
reservation tables and the final adjustments to the instruc- 
tions and code are not described in this example but merely 
the key concepts. Consider a simple machine that is capable 
20 of issuing up to two instructions per clock cycle, one 
memory and one or two computational instructions, in each 
clock cycle. Now consider a loop containing the following 
instructions that is to be scheduled for this machine. Assume 
further that the loop has been examined and that the instruc- 
25 tions marked with an asterisk have been recognized as 
reducible (Step 1). 
load 
fmul 
load 
30 fadd 
store 
add* 
add* 
add* 
35 cmp* 
ble* 

The first partition, non-reducible instructions, require 3 
memory and 2 computational slots. The second partition, 
reducible instructions, require 5 computational slots. 

40 Examining the first partition indicates that three clocks are 
required to issue the instructions in this partition (see Table 
1 below). This is so because there are three memory instruc- 
tions and only one can be issued in a clock cycle. That is, 
mii_without«3. 

45 Now examining the resource requirements of the reduc- 
ible instructions and the empty slot availability in Table 1 
shows that one empty slot is available for the reducible 
instructions. Therefore, mii_without need not be incre- 
mented (Step 5). 

50 

TABLE 1 



Table of non-reducible instructions 

load fmul 
load fadd 
store EMPTY 



The loop is then modulo scheduled with mii_without-3 
(Step 6). For the sake of simplicity, assume that the result of 

60 this schedule is represented just the same as in Table 1. 
Now compute the mkuf for the loop as follows: Since 
there is only one empty slot in one copy of the kernel, and 
there are five reducible instructions to be placed, the kernel 
must be unrolled at least five times i.e., mkuf- 5. Assume that 

65 after this bound is placed on kuf, the value of kuf is 5. In the 
final step, the reducible operations are placed into the empty 
slots of the five copies of the kernel and the displacements 
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etc. are adjusted suitably to preserve program correctness. 
When the kernel is unrolled, a cleanup loop is required to 
execute the remainder iterations. Such issues are not dis- 
cussed here because they are general and do not pertain 
specifically to the invention described here. 
Additional Considerations 

While the above describes the presently preferred 
embodiment, those skilled in the art will recognize that there 
are available other variations of the process for reducing the 
scheduling of loop-overhead or similar instructions. For 
example, the process of adjusting the values in the non- 
reducible instructions used in the preferred embodiment and 
described above may be described more generally as fol- 
lows: 

Assume that the first load (copy 1) is of the form: 

ldlA] 

and it's stride is s. Then we know that the i'th load should 
be of the form: 

ld[A+(i-l)-s] 

(That is, the sequence of load addresses should be of the 
form: A, A+s, A+2s, . . . ). Now if we place a reducible 
instruction of the form: 
add A,d,A 

after the j'th copy of a load, then we can adjust the 
displacements as follows: 

for copies 1 through j of the load, no adjustment is 
required 

for copies j+1 through last (SC-l+KUF) we simply 
subtract d from the displacement For example: 

ld[A] 

ld[A+s] 

ld[A+2*s] 

ld[A + Q-l)*s] 

Reducible instruction "add A,d,A" is say placed here 

ld[A+j*s] 
ld[A + Q + l)*s] 

IdfA+k-s] 

When the reducible instruction is placed as shown above, the 
displacements are adjusted as follows: 

ld[A] 

ld[A+e] 

ld[A+2*s] 



ld[A+(j 
add A,d,A 
ld[A+j*s-d] 
ld[A+0'+l)*»- 

ld[A+k*s-d] 



d] 
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The above step can be done more than once as one places 
multiple copies of the reducible instructions in the prologue, 
kernel and epilogue. 

Similarly, an alternative embodiment of the invention 
could include the following steps or other variations thereof: 

1. Partition the nodes (instructions) into reducible and 
non-reducible as in the preferred embodiment above. 

2. Do not bother to compute values for mkuf due to the 
reducible instructions or increase mii_without for the same 
reason, prior to scheduling. 



60 



65 



3. Schedule the non-reducible instructions as in the pre- 
ferred embodiment and place a branch slot in the last group 
of instructions scheduled. 

4. Scan the resulting schedule and look for empty slots 
available for inserting reducible instructions if required. 

5. If there are reducible slots available, at this time 
compute the mkuf and then generate the schedule with the 
potentially increased kuf and then insert the reducible 
instructions. 

6. If there are no reducible slots available (which should 
not happen if the branch slot is placed early in the schedule) 
then the original schedule can be discarded and the reducible 
instructions are placed in between some groups of non- 
reducible instructions. 

This variation or the invention has the potential to reduce 
the trouble required to place the reducible instructions 
properly and to reduce the effective iteration Interval (II). 

It will be appreciated by those skilled in the art that 
various modifications and alterations may be made in the 
preferred embodiment disclosed herein without departing 
from the scope of the invention. Accordingly, the scope of 
the invention is not to be limited to the particular invention 
embodiments discussed above, but should be defined only 
by the claims set forth below and equivalents thereof. 

What is claimed is: 

1. A computer system having a central processing unit 
(CPU) and random access memory (RAM) coupled to said 
CPU, for use in compiling a target program to run on a target 
computer architecture having a plurality of parallel compu- 
tation units which facilitate instruction pipelining and which 
provides an ability to add two values to form an address used 
in a memory load or store instruction and which permits two 
or more instructions to be issued in a single clock cycle, said 
computer system comprising: 

a compiler system resident in said computer system 
having a front end compiler, a code optimizer and a 
back end code generator; and 

an instruction partition mechanism coupled to said code 
optimizer configured to partition instructions for the 
target program into reducible instructions and non- 
reducible instructions; 

a modulo scheduler mechanism coupled to said code 
optimizer configured to modulo schedule said non- 
reducible instructions; 

an instruction insertion mechanism configured to directly 
insert a copy of one of said reducible instructions into 
a modulo schedule of said non-reducible instructions 
which is produced by said modulo scheduler mecha- 
nism; and 

an instruction modification mechanism coupled to said 
code optimizer configured to identify one or more of 
scheduled non-reducible instructions which would nor- 
mally use a value produced by a designated reducible 
instruction and said instruction modification mecha- 
nism further configured to modify an original oSset in 
an address portion of said identified one or more of 
scheduled non-reducible instructions which use a des- 
ignated reducible instruction. 

2. The computer system of claim 1 wherein the modulo 
scheduler mechanism coupled to said code optimizer con- 
figured to modulo schedule said non-reducible instructions 
is further configured to produce a modulo schedule having 
a prologue section, a kernel section and an epilogue section. 

3. The computer system of claim 1 wherein the instruction 
insertion mechanism configured to directly insert said copy 
of said reducible instructions into a modulo schedule of said 
non-reducible instructions which is produced by said 
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modulo scheduler mechanism will insert no more than three 
copies of a designated reducible instruction into said modulo 
scheduled reducible instructions. 

4. The computer system of claim 2 wherein the instruction 
insertion mechanism configured to directly insert said copy 
of said reducible instructions into a modulo schedule of said 
non-reducible instructions which is produced by said 
modulo scheduler mechanism will insert no more than one 
copy of a designated reducible instruction into each said 
modulo scheduled prologue, kernel and epilogue sections. 

5. An apparatus for optimizing the execution time of 
executable instructions in a target program which is desig- 
nated to run on a target computer architecture having a 
plurality of parallel computation units which facilitate 
instruction pipelining and which provides an ability to add 
two values to form an address used in a memory load or store 
instruction and which permits two or more instructions to be 
issued in a single clock cycle, said apparatus comprising: 

a computer having a processor, a memory, and an input/ 
output section; 

a compiler system resident in said computer memory 
having a front end compiler, a code optimizer and a 
back end code generator; and 

an instruction partition mechanism coupled to said com- 
puter for use by said code optimizer to partition instruc- 
tions for the target program into reducible instructions 
and non-reducible instructions; 

a modulo scheduler mechanism coupled to said computer 
for use by said code optimizer to modulo schedule said 
non-reducible instructions; 

an instruction insertion mechanism configured to directly 
insert said reducible instructions into a modulo sched- 
ule of said non-reducible instructions which is pro- 
duced by said modulo scheduler mechanism; and 

an instruction modification mechanism coupled to said 
computer for use by said code optimizer configured to 
identify one or more of scheduled non-reducible 
instructions which would normally use a value pro- 
duced by a designated reducible instruction and said 
instruction modification mechanism further configured 
to modify an original onset in an address portion of said 
identified one or more of scheduled non-reducible 
instructions which use a designated reducible instruc- 
tion. 

6. A code optimizer for use in an compiler system for 
compiling a target program to run on a target computer 
architecture having a plurality of parallel computation units 
which facilitate instruction pipelining and which provides an 
ability to add two values to form an address used in a 
memory load or store instruction and which permits two or 
more instructions to be issued in a single clock cycle, said 
code optimizer comprising: 

a first portion configured to accept as input an interme- 
diate code representation of said target program; 

a second portion, coupled to said first portion, configured 
to partition instructions for the target program into 
reducible instructions and non-reducible instructions; 

a third portion, coupled to said second portion configured 
to modulo schedule said non-reducible instructions; 

a fourth portion, coupled to said third portion configured 
to directly insert copies of said reducible instructions 
into a modulo schedule of said non-reducible instruc- 
tions which is produced by said third portion; and 

a fifth portion, coupled to said fourth portion configured 
to identify one or more of scheduled non-reducible 
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instructions which would normally use a value pro- 
duced by a designated reducible instruction and said 
fifth portion further configured to modify an original 
oflset in an address portion of said identified one or 
5 more of scheduled non-reducible instructions which 
use a designated reducible instruction, thereby produc- 
ing a schedule of the executable instructions for the 
target program. 

7. A computer controlled method of scheduling the 
executable instructions of a target program directed at a 
target computer architecture having a plurality of parallel 
computation units which facilitate instruction pipelining and 
which provides an ability to add two values to form an 
address used in a memory load or store instruction and 
which permits two or more instructions to be issued in a 

15 single clock cycle, the schedule produced in a manner that 
reduces the number of executable instructions required in 
the schedule, said method comprising the steps of: 
partitioning target program instructions to be compiled 
into a set of reducible instructions and a set of non- 
20 reducible instructions; 

modulo scheduling the set of non-reducible instructions; 
directly inserting a copy of each reducible instruction into 

a schedule of the non-reducible instructions; 
for a copy of a designated reducible instruction inserted 
25 into the schedule, identifying all scheduled non- 
reducible instructions which use the designated reduc- 
ible instruction; and 
modifying the original oflfeet value of the address identi- 
fier of an identified non-reducible instruction which 
30 uses a designated reducible instruction, thereby pro- 
ducing a schedule of executable instructions for the 
target program which contains a minimum number of 
copies of each reducible instruction. 

8. The method of claim 7 wherein the step of directly 
35 inserting a copy of each reducible instruction into a schedule 

of the non-reducible instructions inserts at most one copy of 
the reducible instruction in each of the prologue, kernel and 
epilogue sections of the schedule. 

9. The method of claim 8 wherein the step of directly 
40 inserting a copy of each reducible instruction into a schedule 

of the non-reducible instructions is performed by finding a 
vacant slot in the schedule, inserting the copy of the reduc- 
ible instruction in the vacant slot, and setting the displace- 
ment of the copy to a new value which corresponds to a 
45 function of the original increment in the reducible instruc- 
tion and the location of the copy in the schedule. 

10. The method of claim 9 wherein the displacement of 
the copy is set to a new value which is equal to the value 
MaxUselnPrologue times OriginalDisplacement if the copy 

50 is located in the prologue of the schedule. 

11. The method of claim 9 wherein the displacement of 
the copy is set to a new value which is equal to the value 
"KUF times OriginalDisplacement" if the copy is located in 
the kernel of the schedule. 

55 12. The method of claim 9 wherein the displacement of 
the copy is set to a new value which is equal to the value 
"(Stage Count-l-MaxUselnPrologue) times OriginalDis- 
placement" if the copy is located in the epilogue of the 
schedule. 

60 13. The method of claim 7 wherein the step of modifying 
the original offset value of the address identifier of an 
identified non-reducible instruction which uses a designated 
reducible instruction, further comprises the steps of; 
determining which section of the schedule the using 
65 non-reducible instruction is in, the sections designated 
as the prologue, kernel and epilogue sections of the 
schedule; 
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determining whether the copy of the reducible instruction 
which is used by the identified non-reducible instruc- 
tion is placed in the schedule before or after the using 
non-reducible instruction; and 

adjusting the original offset value in the using non- 5 
reducible instruction by a value which is a function of 
the section of the schedule the using non-reducible 
instruction is in, and whether the using non-reducible 
instruction is located before or after a copy of the 
reducible instruction in the schedule. 10 

14. A computer program product comprising: 

a computer usable medium having computer readable 
program code mechanisms embodied therein to sched- 
ule the executable instructions of a target program 
directed at a target computer architecture having a 
plurality of parallel computation units which facilitate 
instruction pipelining and which provides an ability to 
add two values to form an address used in a memory 
load or store instruction and which permits two or more 
instructions to be issued in a single clock cycle, the 
schedule produced in a manner that reduces the number 20 
of executable instructions required in the schedule, the 
computer readable program code mechanisms in said 
computer program product comprising: 

computer readable code mechanisms to cause a computer 
to partition instructions for a loop in the target program 25 
into reducible instructions and non-reducible instruc- 
tions; 

computer readable code mechanisms to cause the com- 
puter to modulo schedule said non-reducible instruc- 
tions; and 30 

computer readable code mechanisms to cause the com- 
puter to directly insert said reducible instructions into a 
modulo schedule of said non-reducible instructions 
which is produced by said modulo scheduler mecha- 
nism; and 35 

computer readable code mechanisms to cause the com- 
puter to identify one or more of scheduled non- 
reducible instructions which would normally use a 
value produced by a designated reducible instruction 
and said instruction modification mechanism further to 40 
modify an original offset in an address portion of said 
identified one or more of scheduled non-reducible 
instructions which use a designated reducible instruc- 
tion. 
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15. The computer system of claim 1 wherein a reducible 
instruction is defined as an instruction which is an integral 
self-increment or self-decrement of an induction variable, 
wherein the induction variable is incremented and decre- 
mented by a compile -time known integer constant and 
wherein all uses of said instruction can be modified to have 
an immediate displacement representing the computed result 
of the reducible instruction. 

16. The apparatus of claim 5 wherein a reducible instruc- 
tion is defined as an instruction which is an integral self- 
increment or self-decrement of an induction variable, 
wherein the induction variable is incremented and decre- 
mented by a compile -time known integer constant and 
wherein all uses of said instruction can be modified to have 
an immediate displacement representing the computed result 
of the reducible instruction. 

17. The code optimizer of claim 6 wherein a reducible 
instruction is defined as an instruction which is an integral 
self-increment or self -decrement of an induction variable, 
wherein the induction variable is incremented and decre- 
mented by a compile-time known integer constant and 
wherein all uses of said instruction can be modified to have 
an immediate displacement representing the computed result 
of the reducible instruction. 

18. The method of claim 7 wherein a reducible instruction 
is defined as an instruction which is an integral self- 
increment or self-decrement of an induction variable, 
wherein the induction variable is incremented and decre- 
mented by a compile-time known integer constant and 
wherein all uses of said instruction can be modified to have 
an immediate displacement representing the computed result 
of the reducible instruction. 

19. The computer program product of claim 14 wherein a 
reducible instruction is defined as an instruction which is an 
integral self-increment or self-decrement of an induction 
variable, wherein the induction variable is incremented and 
decremented by a compile-time known integer constant and 
wherein all uses of said instruction can be modified to have 
an immediate displacement representing the computed result 
of the reducible instruction. 

***** 
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